Fundamentals of Missing Data in Evaluation

Presentation to MSU Department of Psychology, Program Evaluation Occasional Speaker Series, East Lansing, MI

Steven J. Pierce

Center for Statistical Training and Consulting

2024-12-05

Outline

  • What is missing data?
  • Why do we end up with missing data?
  • Why should we care about missing data?
  • How can we diagnose the missing data issues for a given study?
  • What should we do about missing data?

What is missing data?

Missing data (MD) are measurements you want or intended to collect but did not get.[1]

  • Having MD is common in research & evaluation studies.
  • If you do much evaluation work, you will run into MD.

Why do we end up with missing data?

Data collection doesn’t always go according to plan…

Human Factors Other Factors
Participant behavior Equipment failures
Evaluator errors Records/Databases
Partner behavior Unusual Events

Missing Data & Project Lifecycle

WhenMD Plan Study Planning & Design Collect Data Collection Plan->Collect Enter Data Entry Collect->Enter Manage Data Storage & Management Enter->Manage Analyze Data Analysis Manage->Analyze

Why should we care about missing data?

Ethics for Evaluators

Handling missing data well enacts our guiding principles[2]:

AEA logo.

  • Systematic inquiry
  • Competence
  • Integrity

Scientific Activities[3]

There are 3 major scientific activities that can be affected by missing data.

  • Making structured observations of constructs.
  • Using observations to draw inferences about relationships between constructs.
  • Generalizing the results to populations beyond the collected sample.

Consequences for Measurement[3]

  • Availability of constructs
  • Decreased reliability due to increased error variance
  • Bias from poor content coverage
  • Construct validity

Consequences for Internal Validity[1,3]

  • Selection bias
  • Compromised randomization
  • Power and precision
  • Inaccurate model assumptions

Consequences for Generalizability[1,3]

A representative sample is crucial to generalizing to the intended population!

  • Theory development & cumulative knowledge
  • Policy & decision-making

How can we diagnose the missing data issues for a given study?

Cattell’s Data Box[4]


How much data is there? Data volume is \(N_{values} = P \times V \times T\)

  • Slices through the cube represent subsets of data.
  • Constructs are often measured by groups of adjacent variables (items).
  • Missing data puts holes in your cube!

Types of Missingness[3,5]

  • Item level
  • Construct level
  • Person level (unit non-response)
  • Person-period level (wave nonresponse; intermittent vs. dropout)

Describing the Amount of MD[3,6]

Report numbers & percentages of:

  • Participants w/ any data at each time point (retention/attrition)
  • Complete vs. incomplete cases (overall & by time point)
  • Missing values for each variable & construct
  • Reasons for attrition/dropout and other missing data

Use a Good Tracking System

Track attendance at data collection events & participants’ exit from the study.

Patterns of Missing Data[5]

  • Y: matrix of all the values that could be observed
  • Y_obs: subset of Y values that end up observed
  • Y_miss: subset of Y values that end up missing
  • R: response matrix of dummy-coded missingness indicators showing which Y values are observed (0, part of Y_obs) vs. missing (1, part of Y_miss)

Tip

We can aggregrate and visualize R to describe patterns of missingness!

Example Patterns of Missing Data

Missingness patterns for Dutch boys growth study data (748 boys, 9 variables, 1 time point)[7]

Rubin’s Mechanisms of Missingness[8]

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)

Impact on Statistical Results[3]

Some mechanisms yield more bias: MCAR < MAR < MNAR

MCAR

MCAR is when neither observed nor unobserved values predict which values are missing.

MCAR Missing Missingness (R) Observed Observed Values (Y_obs) Random Random Processes Unrelated to Y Random->Missing Predict Unobserved Unobserved Values (Y_miss)

MAR

MAR is when observed values predict which values are missing.

MAR Missing Missingness (R) Observed Observed Values (Y_obs) Observed->Missing Predict Random Random Processes Unrelated to Y Random->Missing Predict Unobserved Unobserved Values (Y_miss)

MNAR

MNAR is when unobserved values predict which values are missing.

MNAR Missing Missingness (R) Observed Observed Values (Y_obs) Observed->Missing Predict Random Random Processes Unrelated to Y Random->Missing Predict Unobserved Unobserved Values (Y_miss) Unobserved->Missing Predict

Rubin’s Mechanisms & Cattell’s Box[3,4,8]

Classifying large datasets according to Rubin’s mechanisms is messy.

  • Subsets of the data may fit into different mechanisms
  • Classify mechanisms for meaningful subsets of data defined by dimensions of the data box (persons, variables, times)[3]

Predictors of Missingness in Longitudinal Studies

  • Study arms and/or sites
  • Baseline/pretest values of outcome variables
  • Other covariates (demographics)

Tip

Consider predictors of person, item, construct, and person-period missingness. Think carefully about your study context and data to look for meaningful, sensible things to test when evaluating missing data issues.

What should we do about missing data?

Prevention: Data Collection Frequency & Timing[3]

  • Frequent data collection with short intervals between events is burdensome
  • Collect longitudinal data at sensible times given expected temporal patterns

Prevention: Number of Variables[3,9,10]

Every variable is an opportunity for missing data.

  • Only collect variables that you need
  • Choose instruments wisely (short, reliable, valid, & relevant)
  • Consider collecting multiple measures of key constructs

Prevention: Dropout[3,911]

  • Establish relationships
  • Maintain contact with participants
  • Offer adequate incentives
  • Make participation convenient

Prevention: Data Collection Methods

  • Develop detailed protocols
  • Train data collection & entry staff
  • Pay attention to instrument design
  • Test equipment & instruments

Treatment: Traditional Methods[3,12]


Data Deletion Single Imputation
Listwise deletion Mean Substitution
Pairwise deletion Hot-Deck Imputation
Available Items Analysis Regression Imputation
Last Observation Carried Forward

Treatment: Modern Methods[3,12]

  • Full-information maximum likelihood (FIML) estimation
  • Multiple imputation (MI)

Treatment: FIML[3,12]

FIML estimates parameters by combining observed data, relationships among observed variables, and assumptions about distributions.

  • Seeks parameter estimates that best fit the data
  • Uses all available data without imputing any values
  • Benefits from auxilliary variables
  • Works well with MCAR and MAR
  • Yields biased estimates under MNAR
  • Available in structural equation modeling (SEM) software

Practical Options

  • Item-level missingness in scale scores[13,14]
  • Collaborate with a statistician!

Software Tools

  • SPSS: check out the MVA and MULTIPLE IMPUTATION syntax commands
  • R packages: see the CRAN Task View on Missing Data
  • Both FIML and MI are supported by Mplus and the R package lavaan

References

1. Fernández-García, M. P., Vallejo-Seco, G., Livácic-Rojas, P., & Tuero-Herrero, E. (2018). The (ir)responsibility of (under)estimating missing data. Frontiers in Psychology, 9(556). https://doi.org/10.3389/fpsyg.2018.00556
2. American Evaluation Association. (2018). Guiding principles for evaluators [Web Page]. Author. https://www.eval.org/About/Guiding-Principles
3. McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing data: A gentle introduction. Guilford Press.
4. Cattell, R. B. (1966). The data box: Its ordering of total resources in terms of possible relations systems. In R. B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp. 67–128). Rand McNally.
5. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Bulletin, 7(2), 147–777. https://doi.org/10.1037//1082-989X.7.2.I47
6. van Buuren, S. (2018). Flexible imputation of missing data (2nd ed.). Chapman & Hall/CRC Press. https://doi.org/10.1201/9780429492259
7. Fredriks, A. M., Buuren, S. van, Burgmeijer, R. J. F., Meulmeester, J. F., Beuker, R. J., Brugman, E., Roede, M. J., Verloove-Vanhorick, S. P., & Wit, J.-M. (2000). Continuing positive secular growth change in the netherlands 1955-1997. Pediatric Research, 47, 316–323. https://doi.org/10.1203/00006450-200003000-00006
8. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
9. Leeuw, E. D. de. (2001). Reducing missing data in surveys: An overview of methods. Quality & Quantity, 35(2), 147–160. https://doi.org/10.1023/A:1010395805406
10. Wisniewski, S. R., Leon, A. C., Otto, M. W., & Trivedi, M. H. (2006). Prevention of missing data in clinical research studies. Biological Psychiatry, 59, 997–1000. https://doi.org/10.1016/j.biopsych.2006.01.017
11. Laurie, H. (2008). Minimizing panel attrition. In S. Menard (Ed.), Handbook of longitudinal research (pp. 167–184). Academic Press.
12. Enders, C. K. (2022). Applied missing data analysis (2nd ed.). Guilford Press. https://www.appliedmissingdata.com
13. Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. https://doi.org/10.1146/annurev.psych.58.110405.085530
14. Newman, D. A. (2014). Missing data: Five practical guidelines. Organizational Research Methods, 17(4), 372–411. https://doi.org/10.1177/1094428114548590